The R environment always points to a certain directory on our computer, which is known as the working directory.
We can get the current working directory with getwd
Type that in a new chunk. What is your current working directory?
The R environment always points to a certain directory on our computer, which is known as the working directory.
We can get the current working directory with getwd
Type that in a new chunk. What is your current working directory?
If your working directory is not week 3, create a new folder called “week3”
Then set the directory using setwd
To know the directory on a MacOS, you go to the week3 folder and click on “Get Info”
Select what is in the “Where” section
Select what is in the “Where” section and copy it using “Cmd + C”
In this case, the working directory will be:
[1] "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data"
We now simply need to add week3 to this path
Our final path will be:
[1] "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week3"
To set this directory, we type in:
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
In this case, the working directory will be:
\\Mac\Dropbox-1\john_cabot\teaching\big_data
We first need to change all the backslashes to forward slashes:
[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data"
We now simply need to add week3 to this path
Our final path will be:
[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data/week3"
To set this directory, we type in:
In your case, this will look something like:
The easiest datasets to work with are csv files
R is also capabale of reading a variety of other dataset formats including: .dta, .sas, .xlsx, xls, txt, etc.
The type of library used to read these files will have implications for quickly your computer can read the data
Let us look at some examples
read_csvread.csvread_csvlibrary("readr")
#This is to set the directory
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week3/")
# Reading data
data_1851_obs10000<-read_csv("./data/data_1851_obs10000.csv")
#Recording how long it takes
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000<-read_csv("./data/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_a <- end_time - start_time
#Step5: Priting the difference
time_taken_aTime difference of 0.1398301 secs
read.csv# Reading data
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000_b<-read.csv("./data/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_b <- end_time - start_time
#Step5: Priting the difference
time_taken_bTime difference of 0.2031388 secs
read.csvread_csvread.csv or read_csv makes a difference#Reading CSV files
csv_file<-read.csv("./data/data_1851_obs10000.csv")
#Reading a Stata file
#Library for reading Stata files
library("haven")
#Library for reading SPSS files
library("foreign")
stata_file<-read_dta("./data/data_1851_obs10000.dta")
spss_file<-read.spss("./data/data_1851_obs10000.sav", to.data.frame=TRUE)
#Library for reading Excel file
library("readxl")
excel_file<-read_excel("./data/data_1851_obs10000.xlsx")You can easily write CSV files to your hard drive using the readr library
Tidy datasets are all alike
Tidy datasets are all alike
Tidy datasets are all alike
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
You can become friends with tidy data by following one of the following strategies:
1999 and 2000yearcasesThis is what that looks like in code
Original
Fix
#table4a
#> # A tibble: 12 × 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # … with 6 more rowsThis is what that looks like in code
Original
#table4a
#> # A tibble: 12 × 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # … with 6 more rowsFix
#table4a %>%
#> pivot_wider(names_from = type,
#> values_from = count)
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583We sometimes may have columns that contain data which should separated in multiple columns
This where we use separate()
This is what that looks like in code
Original
Fix
#table3 %>%
# separate(rate, into = c("cases", "population"))
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583Uniting is the inverse of separating
It combines multiple columns into a single column
This is what that looks like in code
Original
Fix
#table5 %>%
# unite(new, century, year, sep = "")
#> # A tibble: 6 × 3
#> country new rate
#> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583Here is an example:
The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA
The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
To deal with them, we can use values_drop_na = TRUE
library("tibble")
library("tidyr")
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)# A tibble: 6 × 3
qtr year return
<dbl> <chr> <dbl>
1 1 2015 1.88
2 2 2015 0.59
3 2 2016 0.92
4 3 2015 0.35
5 3 2016 0.17
6 4 2016 2.66
Notice the use of values_drop_na = TRUE
Notice that the 4th quarter of 2015 is missing now
We can also use complete()
# A tibble: 8 × 3
year qtr return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2015 3 0.35
4 2015 4 NA
5 2016 1 NA
6 2016 2 0.92
7 2016 3 0.17
8 2016 4 2.66
complete() takes a set of columns, and finds all unique combinations
It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.
Sometimes the NA is not random
Missing values could indicate that the previous value should be carried forward:
Example:
We can deal with this problem by using fill()
Original
In the real world, people typically have to work with many tables
Multiple tables of data that need to be combined together, are called relational data
We will first load two libraries
There are five databases (tibbles) inside nycflights13:
airlinesairportsplanesweatherflightsairlines tibble# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
airports tibble# A tibble: 1,458 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# ℹ 1,448 more rows
weather tibble# A tibble: 26,115 × 15
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
# visib <dbl>, time_hour <dttm>
planes tibble# A tibble: 3,322 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
# ℹ 3,312 more rows
flights tibble# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
This is how the five tibbles could be connected to one another
flights connects to planes via a single variable, tailnumflights connects to planes via a single variable, tailnumflights connects to airlines through the carrier variableflights connects to airports through: the origin and dest variablesflights connects to airports through: the origin and dest variablesflights connects to weather via origin (the location), and year, month, day and hour (the time)The variables used to connect the tibbles are keys which are used to uniquely identify the observations
Sometimes one variable uniquely identifies the observation - tainum for planes
Other times, more than one variable is necessary - year, month, day, hour, and origin.
It is always good practice to make sure that the keys identified are in fact unique
This can be done this way for planes:
# A tibble: 0 × 2
# ℹ 2 variables: tailnum <chr>, n <int>
tailnum uniquely identifies planesflights in the following way:# A tibble: 29,768 × 5
year month day flight n
<int> <int> <int> <int> <int>
1 2013 1 1 1 2
2 2013 1 1 3 2
3 2013 1 1 4 2
4 2013 1 1 11 3
5 2013 1 1 15 2
6 2013 1 1 21 2
7 2013 1 1 27 4
8 2013 1 1 31 2
9 2013 1 1 32 2
10 2013 1 1 35 2
# ℹ 29,758 more rows
flights, it seems that the combination of year, month, day, and flight is not enough to uniquely identify observationsx.yx and yThe most common type of join that you will need in practive is a left join
This allows you to preserve the original observation even when there is not match
It is important to consider what happens when we have duplicate keys
Case 1: One Table has duplicated keys
It is important to consider what happens when we have duplicate keys
Case 2: Both Tables have duplicated keys
When you join duplicated keys, you get all possible combinations, the Cartesian product
flights connects to planes via a single variable, tailnumStep1: Selecting only relevant variables
Step2: Performing the Left Join
# A tibble: 336,776 × 14
day hour origin dest tailnum carrier year type manufacturer model
<int> <dbl> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
1 1 5 EWR IAH N14228 UA 1999 Fixed wing… BOEING 737-…
2 1 5 LGA IAH N24211 UA 1998 Fixed wing… BOEING 737-…
3 1 5 JFK MIA N619AA AA 1990 Fixed wing… BOEING 757-…
4 1 5 JFK BQN N804JB B6 2012 Fixed wing… AIRBUS A320…
5 1 6 LGA ATL N668DN DL 1991 Fixed wing… BOEING 757-…
6 1 5 EWR ORD N39463 UA 2012 Fixed wing… BOEING 737-…
7 1 6 EWR FLL N516JB B6 2000 Fixed wing… AIRBUS INDU… A320…
8 1 6 LGA IAD N829AS EV 1998 Fixed wing… CANADAIR CL-6…
9 1 6 JFK MCO N593JB B6 2004 Fixed wing… AIRBUS A320…
10 1 6 LGA ORD N3ALAA AA NA <NA> <NA> <NA>
# ℹ 336,766 more rows
# ℹ 4 more variables: engines <int>, seats <int>, speed <int>, engine <chr>
In the previous example, we looked at join by tailum
By we can also perform join by multiple variables
# A tibble: 336,776 × 29
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
# humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
# precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>
Filtering joins match observations in the same way as mutating joins
They affect observations, not variables
semi_join(x, y) keeps all observations in x that have a match in yanti_join(x, y) drops all observations in x that have a match in yOnly the existence of a match is important
It doesn’t matter which observation is matched
It is the inverse of the semi-join
The anti-join keeps the rows that don’t have a match
They are useful for diagnosing join mismatches
When working spatial data, it is often necessary to work with time dates and time series
In other words, spatial data may contain temporal information
Some common used classes to repsent time in R include:
Times:
Date - To represent data at day-by-day levelPOSIXct and POSIXlt to represent data at a second-by-second levelTime Series:
tszoo (with the zoo package)xts (with the xts package)The simplest data structure to represent time is Date
For example, we can turn an object from character to Date by using as.Date function
The character values are in the standard format known as ISO 8601 (YYYY-MM-DD)
The as.Date fuction works without additional arguments
When the character values are in a non-standard format, we need to specify the format definition
The following table lists the most commonly used symbols for specifying date in R.
| Symbol | Example | Meaning |
|---|---|---|
%d |
“15” | Day |
%m |
“08” | Month, Numeric |
%b |
“Aug” | Month, 3-letter |
%B |
“August” | Month, full |
%y |
14 | Year, 2-digit |
%Y |
2014 | Year, 4-digit |
For example, we need to specify the format for certain date types.
To fix the error, we need to type:
For example, we need to specify the format for certain date types.
To fix the error, we need to type:
The following will not work
This is because the first two are Date objects while the last two are character objects
Date objects act like numeric objects
We can thus perform:
Another example:
If we look at time_diff1 and time_diff2, they are different classes
We can turn them to numeric classes and to specific units.
We can also create a sequence of dates using seq.
For example, the following expression creates a sequence of dates.
This is how we can subset our date vector.
The diff function can be used to create a vector of differences between consecutive elements
For example:
or
Note that the length of diff(x) is one element less that x
We don’t have the difference for the first (or last) element, depending how you look at it
Popescu (JCU): Lecture 3